vanilla model
AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition
Lin, Zichuan, Liu, Yicheng, Yang, Yang, Tao, Lvfang, Ye, Deheng
Vision-Language Models (VLMs) have achieved remarkable success in visual question answering tasks, but their reliance on large numbers of visual tokens introduces significant computational overhead. While existing efficient VLM approaches reduce visual tokens through fixed-ratio compression, they operate passively and lack the ability to adapt to varying task requirements. This motivates a fundamental question: Can VLMs autonomously determine the minimum number of visual tokens required for each sample? Inspired by human active vision mechanisms, we introduce AdaptVision, an efficient VLM paradigm that enables adaptive visual token acquisition through a coarse-to-fine approach. Our model initially processes compressed visual tokens from low-resolution images and selectively acquires additional visual information by invoking a bounding box tool to crop key regions when necessary. We train AdaptVision using a reinforcement learning framework that carefully balances accuracy and efficiency. Central to our approach is Decoupled Turn Policy Optimization (DTPO), which decouples the learning objective into two components: (1) tool learning, which optimizes correct tool utilization, and (2) accuracy improvement, which refines the generated responses to improve answer correctness. Based on this formulation, we further decouple advantage estimation by computing separate advantages for tokens associated with each objective. This formulation enables more effective optimization for AdaptVision compared to vanilla GRPO. Comprehensive experiments across multiple VQA benchmarks demonstrate that AdaptVision achieves superior performance while consuming substantially fewer visual tokens than state-of-the-art efficient VLM methods.
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Asia > China (0.04)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.70)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.46)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
Block Transformer: Global-to-Local Language Modeling for Fast Inference
We introduce the Block Transformer which adopts hierarchical global-to-local modeling to autoregressive transformers to mitigate the inference bottlenecks associated with self-attention. Self-attention requires the key-value (KV) cache of all previous sequences to be retrieved from memory at every decoding step to retrieve context information, leading to two primary bottlenecks during batch inference. First, there is a significant delay in obtaining the first token, as the information of the entire prompt must first be processed to prefill the KV cache.
- Asia > Middle East > Jordan (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- North America > United States > Washington > King County > Seattle (0.04)
- (8 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.93)
- Overview (0.92)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.05)
- North America > United States > New York (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- (2 more...)
- Health & Medicine > Therapeutic Area > Neurology (1.00)
- Health & Medicine > Health Care Technology (0.77)
We sincerely thank all reviewers for the insightful comments and feedback on our work of learning from failure (LfF)
We sincerely thank all reviewers for the insightful comments and feedback on our work of learning from failure (LfF). We do not interpret this as a "true" trade-off, as debiasing does not degrade the model's Instead, we view the apparent underperformance as a result of "not utilizing a (delusional) spurious correlation." Following R1's suggestion, we additionally test ReBias [2] (SOT A among This is also consistent with our claim that LfF is not "domain-specific" However, this consistency may not hold depending on the definition of "domain." Hence, we deeply resonate with R2's concern, and we will further clarify the type of knowledge used by LfF and For example, we will modify L2-5 in the abstract by "In this work, we propose a new algorithm utilizing a However, we only use the LfF's yes/no type of knowledge for choosing one of the attributes as an undesired Following R2's suggestion, we further verify Our LfF combination rule achieves 74.01% We will add more discussions and experiments in the final draft.
Adaptive Residual-Update Steering for Low-Overhead Hallucination Mitigation in Large Vision Language Models
Zou, Zhengtao, Gao, Ya, Guan, Jiarui, Li, Bin, Marttinen, Pekka
Large Vision-Language Models (L VLMs) often suffer from object hallucination, generating text inconsistent with visual inputs, which can critically undermine their reliability. Existing inference-time interventions to mitigate this issue present a challenging trade-off: while methods that steer internal states or adjust output logits can be effective, they often incur substantial computational overhead, typically requiring extra forward passes. This efficiency bottleneck can limit their practicality for real-world, latency-sensitive deployments. In this work, we aim to address this trade-off with Residual-Update Directed DEcoding Regulation (RUDDER), a low-overhead framework that steers L VLMs towards visually-grounded generation. RUDDER is built on two key innovations: (1) Contextual Activation Residual Direction (CARD) vector, a per-sample visual evidence vector extracted from the residual update of a self-attention layer during a single, standard forward pass. Extensive experiments on key hallucination benchmarks, including POPE and CHAIR, indicate that RUDDER achieves performance comparable to state-of-the-art methods while introducing negligible computational latency, validating RUDDER as a pragmatic and effective approach for improving L VLMs' reliability without a significant compromise on efficiency. Code is available at https://anonymous.4open.science/r/ While Large Vision-Language Models (L VLMs) have shown remarkable capabilities in multimodal tasks and are increasingly deployed to assist with real-world problems (Alayrac et al., 2022; Liu et al., 2024a), their practical reliability is critically undermined by a persistent challenge: object hallucination. As shown in Figure 1, L VLMs frequently generate fluent, convincing text that is factually inconsistent with visual groundings, severely limiting their real-world utility and credibility (Ji et al., 2023).
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.94)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
- North America > United States > North Carolina (0.04)
- North America > Canada (0.04)
- Education (0.68)
- Information Technology (0.46)
In-Context Learning Without Copying
Sahin, Kerem, Feucht, Sheridan, Belfki, Adam, Brinkmann, Jannik, Mueller, Aaron, Bau, David, Wendler, Chris
Induction heads are attention heads that perform inductive copying by matching patterns from earlier context and copying their continuations verbatim. As models develop induction heads, they often experience a sharp drop in training loss, a phenomenon cited as evidence that induction heads may serve as a prerequisite for more complex in-context learning (ICL) capabilities. In this work, we ask whether transformers can still acquire ICL capabilities when inductive copying is suppressed. We propose Hapax, a setting where we omit the loss contribution of any token that can be correctly predicted by induction heads. Despite a significant reduction in inductive copying, performance on abstractive ICL tasks (i.e., tasks where the answer is not contained in the input context) remains comparable and surpasses the vanilla model on 13 of 21 tasks, even though 31.7\% of tokens are omitted from the loss. Furthermore, our model achieves lower loss values on token positions that cannot be predicted correctly by induction heads. Mechanistic analysis further shows that models trained with Hapax develop fewer and weaker induction heads but still preserve ICL capabilities. Taken together, our findings indicate that inductive copying is not essential for learning abstractive ICL mechanisms.
- Europe > Austria > Vienna (0.14)
- North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
- North America > United States > Virginia (0.04)
- (8 more...)
A Novel XAI-Enhanced Quantum Adversarial Networks for Velocity Dispersion Modeling in MaNGA Galaxies
Narkedimilli, Sathwik, Kumar, N V Saran, H, Aswath Babu, Vanahalli, Manjunath K, M, Manish, Jain, Vinija, Chadha, Aman
In the ever-evolving landscape of astrophysics and machine learning, understanding the internal kinematics of galaxies remains a formidable challenge. Traditional techniques for modeling galaxy dynamics have offered valuable insights but are often limited by their inability to capture complex, non-linear relationships in high-dimensional data. Recent advances in quantum computing and explainable artificial intelligence (XAI) provide new avenues for addressing these challenges, paving the way for more sophisticated and interpretable models in astrophysical research [19] [20] [21]. Galaxy velocity dispersion is a critical parameter that underpins our understanding of the mass distribution, dynamical state, and evolutionary history of galaxies. By analyzing detailed stellar population and kinematic properties--such as morphological classification, effective radius, and gradients in stellar age and metallicity, the prediction of velocity dispersion becomes central to characterizing the intricate interplay between a galaxy's structure and its dynamic behavior. The MaNGA dataset, with its rich set of 11 features, offers a robust platform for exploring these phenomena and highlights the technical demands of achieving accurate predictions in this domain [1].
- Asia > India (0.04)
- North America > United States (0.04)
- Europe > France > Île-de-France > Paris > Paris (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
The Robustness of Differentiable Causal Discovery in Misspecified Scenarios
Yi, Huiyang, He, Yanyan, Chen, Duxin, Kang, Mingyu, Wang, He, Yu, Wenwu
Causal discovery aims to learn causal relationships between variables from targeted data, making it a fundamental task in machine learning. However, causal discovery algorithms often rely on unverifiable causal assumptions, which are usually difficult to satisfy in real-world data, thereby limiting the broad application of causal discovery in practical scenarios. Inspired by these considerations, this work extensively benchmarks the empirical performance of various mainstream causal discovery algorithms, which assume i.i.d. data, under eight model assumption violations. Our experimental results show that differentiable causal discovery methods exhibit robustness under the metrics of Structural Hamming Distance and Structural Intervention Distance of the inferred graphs in commonly used challenging scenarios, except for scale variation. We also provide the theoretical explanations for the performance of differentiable causal discovery methods. Finally, our work aims to comprehensively benchmark the performance of recent differentiable causal discovery methods under model assumption violations, and provide the standard for reasonable evaluation of causal discovery, as well as to further promote its application in real-world scenarios.
- Asia > China > Jiangsu Province > Nanjing (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)